language variation
Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses
Spelling variation (e.g. funnnn vs. fun) can influence the social perception of texts and their writers: we often have various associations with different forms of writing (is the text informal? does the writer seem young?). In this study, we focus on the social perception of spelling variation in online writing in English and study to what extent this perception is aligned between humans and large language models (LLMs). Building on sociolinguistic methodology, we compare LLM and human ratings on three key social attributes of spelling variation (formality, carefulness, age). We find generally strong correlations in the ratings between humans and LLMs. However, notable differences emerge when we analyze the distribution of ratings and when comparing between different types of spelling variation.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada (0.04)
- (26 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.92)
Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants
Bengoetxea, Jaione, Gonzalez-Dios, Itziar, Agerri, Rodrigo
In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- South America > Ecuador (0.05)
- (18 more...)
Tokenization is Sensitive to Language Variation
Wegmann, Anna, Nguyen, Dong, Jurgens, David
Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models for the popular Byte-Pair Encoding algorithm to investigate how key algorithmic design choices impact downstream models' performances: fitting corpus, pre-tokenizer and vocabulary size. We find that the best tokenizer varies on the two task types -- with the pre-tokenizer having the biggest impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing significant improvement over techniques like R\'enyi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- (27 more...)
- Media > News (0.46)
- Information Technology > Services (0.46)
A Topic-aware Comparable Corpus of Chinese Variations
This study aims to fill the gap by constructing a topic-aware comparable corpus of Mainland Chinese Mandarin and Taiwanese Mandarin from the social media in Mainland China and Taiwan, respectively. Using Dcard for Taiwanese Mandarin and Sina Weibo for Mainland Chinese, we create a comparable corpus that updates regularly and reflects modern language use on social media.
Linguistic Fingerprint in Transformer Models: How Language Variation Influences Parameter Selection in Irony Detection
Mastromattei, Michele, Zanzotto, Fabio Massimo
Sentiment analysis datasets, particularly those annotated on crowdsourcing platforms, may contain biases due to the lack of information about the cultural backgrounds of the annotators. This can lead to machine learning models trained on this data amplifying these biases, affecting how people perceive and label sentiment. Although these models can capture general sentiment, they often fail to capture the nuances experienced by different groups. This paper examines the impact of linguistic diversity on transformer models designed for irony detection. Using the EPIC corpus [1], we created five subsets tailored to different variations of English. We trained different transformer models and used the KEN pruning algorithm [2] to extract the minimum subset of optimal parameters that maintain the original performance of the model. We conducted this experimental process across five transformer architectures, revealing a minimum parameter overlap of 60% among resulting subnetworks. We then performed a comprehensive analysis to identify subnetworks with the highest and lowest similarity.
- North America > United States (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- Asia > Middle East > Jordan (0.04)
The Good Robot Podcast: featuring Su Lin Blodgett
Hosted by Eleanor Drage and Kerry Mackereth, The Good Robot is a podcast which explores the many complex intersections between gender, feminism and technology. In this episode, Microsoft Senior Researcher Su Lin Blodgett explores whether you can use AI to measure discrimination, why AI can never be de-biased, and how AI shows us that categories like gender and race are not as clear cut as we think they are. Su Lin is a senior researcher in the Fairness, Accountability, Transparency, and Ethics in AI (FATE) group at Microsoft Research Montréal. She is broadly interested in examining the social and ethical implications of natural language processing technologies, and develops approaches for anticipating, measuring, and mitigating harms arising from language technologies, focusing on the complexities of language and language technologies in their social contexts, and on supporting NLP practitioners in their ethical work. She has also worked on using NLP approaches to examine language variation and change (computational sociolinguistics), for example developing models to identify language variation on social media.
- North America > Canada > Quebec > Montreal (0.29)
- North America > United States > Massachusetts (0.19)
Regionalized models for Spanish language variations based on Twitter
Tellez, Eric S., Moctezuma, Daniela, Miranda, Sabino, Graff, Mario, Ruiz, Guillermo
Spanish is one of the most spoken languages in the globe, but not necessarily Spanish is written and spoken in the same way in different countries. Understanding local language variations can help to improve model performances on regional tasks, both understanding local structures and also improving the message's content. For instance, think about a machine learning engineer who automatizes some language classification task on a particular region or a social scientist trying to understand a regional event with echoes on social media; both can take advantage of dialect-based language models to understand what is happening with more contextual information hence more precision. This manuscript presents and describes a set of regionalized resources for the Spanish language built on four-year Twitter public messages geotagged in 26 Spanish-speaking countries. We introduce word embeddings based on FastText, language models based on BERT, and per-region sample corpora. We also provide a broad comparison among regions covering lexical and semantical similarities; as well as examples of using regional resources on message classification tasks.
- North America > United States (0.14)
- South America > Argentina (0.05)
- North America > Cuba (0.04)
- (35 more...)
- Information Technology > Services (0.93)
- Health & Medicine (0.68)
Efficient Test Time Adapter Ensembling for Low-resource Language Varieties
Wang, Xinyi, Tsvetkov, Yulia, Ruder, Sebastian, Neubig, Graham
Adapters are light-weight modules that allow parameter-efficient fine-tuning of pretrained models. Specialized language and task adapters have recently been proposed to facilitate cross-lingual transfer of multilingual pretrained models (Pfeiffer et al., 2020b). However, this approach requires training a separate language adapter for every language one wishes to support, which can be impractical for languages with limited data. An intuitive solution is to use a related language adapter for the new language variety, but we observe that this solution can lead to sub-optimal performance. In this paper, we aim to improve the robustness of language adapters to uncovered languages without training new adapters. We find that ensembling multiple existing language adapters makes the fine-tuned model significantly more robust to other language varieties not included in these adapters. Building upon this observation, we propose Entropy Minimized Ensemble of Adapters (EMEA), a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. Experiments on three diverse groups of language varieties show that our method leads to significant improvements on both named entity recognition and part-of-speech tagging across all languages.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems
Peng, Baolin, Li, Chunyuan, Zhang, Zhu, Zhu, Chenguang, Li, Jinchao, Gao, Jianfeng
For task-oriented dialog systems to be maximally useful, it must be able to process conversations in a way that is (1) generalizable with a small number of training examples for new task domains, and (2) robust to user input in various styles, modalities or domains. In pursuit of these goals, we introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. By including tasks with limited training data, RADDLE is designed to favor and encourage models with a strong generalization ability. RADDLE also includes a diagnostic checklist that facilitates detailed robustness analysis in aspects such as language variations, speech errors, unseen entities, and out-of-domain utterances. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain. Overall, existing models are less than satisfactory in robustness evaluation, which suggests opportunities for future improvement.
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Washington > King County > Redmond (0.04)
- (3 more...)
Dialog Simulation with Realistic Variations for Training Goal-Oriented Conversational Systems
Lin, Chien-Wei, Auvray, Vincent, Elkind, Daniel, Biswas, Arijit, Fazel-Zarandi, Maryam, Belgamwar, Nehal, Chandra, Shubhra, Zhao, Matt, Metallinou, Angeliki, Chung, Tagyoung, Zhu, Charlie Shucheng, Adhikari, Suranjit, Hakkani-Tur, Dilek
Goal-oriented dialog systems enable users to complete specific goals like requesting information about a movie or booking a ticket. Typically the dialog system pipeline contains multiple ML models, including natural language understanding, state tracking and action prediction (policy learning). These models are trained through a combination of supervised or reinforcement learning methods and therefore require collection of labeled domain specific datasets. However, collecting annotated datasets with language and dialog-flow variations is expensive, time-consuming and scales poorly due to human involvement. In this paper, we propose an approach for automatically creating a large corpus of annotated dialogs from a few thoroughly annotated sample dialogs and the dialog schema. Our approach includes a novel goal-sampling technique for sampling plausible user goals and a dialog simulation technique that uses heuristic interplay between the user and the system (Alexa), where the user tries to achieve the sampled goal. We validate our approach by generating data and training three different downstream conversational ML models. We achieve 18 ? 50% relative accuracy improvements on a held-out test set compared to a baseline dialog generation approach that only samples natural language and entity value variations from existing catalogs but does not generate any novel dialog flow variations. We also qualitatively establish that the proposed approach is better than the baseline. Moreover, several different conversational experiences have been built using this method, which enables customers to have a wide variety of conversations with Alexa.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Santa Clara County > Sunnyvale (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.87)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.83)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.54)